library(data.table)
library(tidyverse)
library(tidytext)
library(NLP)
library(openNLP) # needs Java JDK: https://www.oracle.com/java/technologies/downloads/
library(corpustools)
library(topicmodels)
library(ggwordcloud)
library(wordcloud)Working with text data
SMS Spam Collection
We’ll work with a small SMS corpus (publicly available at http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/) containing 5,574 messages which have been labelled as either spam or legitimate (ham). Thus, it is an interesting data set for feature extraction and machine learning: How can we best predict whether a message is spam or not?
(NB: Many ham messages come from Singapore. If a certain word confuses you (such as la, mah or lor), it’s probably Singlish: https://en.wikipedia.org/wiki/Singlish_vocabulary)
Before we can predict anything, however, we’ll have to read in the data and learn how to handle text data in R.
smsspam <- fread(
"../data/smsspam.tsv", # path to file
quote = "", #
header = FALSE, # first row doesn't contain column names
col.names = c("y", "text"), # name the two columns,
encoding = "UTF-8" # specify encoding (some problems remain)
)
# this would also work:
# test <- read_tsv(
# "data/smsspam.tsv",
# quote = "",
# col_names = c("y", "text")
# )
str(smsspam)Classes 'data.table' and 'data.frame': 5574 obs. of 2 variables:
$ y : chr "ham" "ham" "spam" "ham" ...
$ text: chr "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question("| __truncated__ "U dun say so early hor... U c already then say..." ...
- attr(*, ".internal.selfref")=<externalptr>
Basic string manipulation
Here, we’ll take a look at common string functions, first in base R, then in the Tidyverse (which might be a little more intuitive, or, at the very least, a little more tidy).
In order not to be overwhelmed by the output of many functions, we’ll define two smaller objects to work with: texts, a character vector containing the first 10 texts from the SMS corpus, and text, the first of these texts.
texts <- head(smsspam$text, 10)
texts [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
[2] "Ok lar... Joking wif u oni..."
[3] "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
[4] "U dun say so early hor... U c already then say..."
[5] "Nah I don't think he goes to usf, he lives around here though"
[6] "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"
[7] "Even my brother is not like to speak with me. They treat me like aids patent."
[8] "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"
[9] "WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only."
[10] "Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030"
text <- texts[2]
text[1] "Ok lar... Joking wif u oni..."
Base R
Number of characters (length)
We’ve already used nchar() before. By default, this gives us the number of individual characters in a string (or character, but you can see why R’s terminology might be confusing here). As the function is vectorised, it can also be used to get the lengths of all strings in a vector:
nchar(text)[1] 29
nchar(texts) [1] 111 29 155 49 61 147 77 160 157 154
Note that you can also the optional parameter type to get the number of bytes needed to store the string:
"\u2640"[1] "♀"
nchar("\u2640", type = "bytes")[1] 3
Substrings
To get a substring (part of a string), you need to specify start and end position (remember that in R, you start counting at 1, not at 0):
substr(text, start = 11, stop = 16) # select[1] "Joking"
You can also assign a new value to a substring, thereby changing the original string:
tmp <- text
substr(tmp, start = 11, stop = 16) <- "Kidding"
tmp[1] "Ok lar... Kiddin wif u oni..."
What happens if replacement length and value to replace are different?
substr() also works on vectors:
substr(texts, start = 1, stop = 10) [1] "Go until j" "Ok lar... " "Free entry" "U dun say " "Nah I don'"
[6] "FreeMsg He" "Even my br" "As per you" "WINNER!! A" "Had your m"
strsplit() can be used to, well, split strings at the position where a specified string (or regular expression) occurs.
Note that this function always yields a list:
strsplit(text, " ") # split[[1]]
[1] "Ok" "lar..." "Joking" "wif" "u" "oni..."
You can use unlist() to simplify a list to a vector of all atomic elements contained in the list:
strsplit(text, " ") |> unlist()[1] "Ok" "lar..." "Joking" "wif" "u" "oni..."
The list output makes more sense when applying strsplit() to a vector:
strsplit(texts[1:2], " ")[[1]]
[1] "Go" "until" "jurong" "point," "crazy.." "Available"
[7] "only" "in" "bugis" "n" "great" "world"
[13] "la" "e" "buffet..." "Cine" "there" "got"
[19] "amore" "wat..."
[[2]]
[1] "Ok" "lar..." "Joking" "wif" "u" "oni..."
Concatenation and insertion
To join strings (or vectors of strings), you can use paste():
paste("sms", text, sep = " : ")[1] "sms : Ok lar... Joking wif u oni..."
paste("sms", texts, sep = " : ") [1] "sms : Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
[2] "sms : Ok lar... Joking wif u oni..."
[3] "sms : Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
[4] "sms : U dun say so early hor... U c already then say..."
[5] "sms : Nah I don't think he goes to usf, he lives around here though"
[6] "sms : FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"
[7] "sms : Even my brother is not like to speak with me. They treat me like aids patent."
[8] "sms : As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"
[9] "sms : WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only."
[10] "sms : Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030"
paste(head(smsspam$y, 10), ": ", texts, sep = "") [1] "ham: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
[2] "ham: Ok lar... Joking wif u oni..."
[3] "spam: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
[4] "ham: U dun say so early hor... U c already then say..."
[5] "ham: Nah I don't think he goes to usf, he lives around here though"
[6] "spam: FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"
[7] "ham: Even my brother is not like to speak with me. They treat me like aids patent."
[8] "ham: As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"
[9] "spam: WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only."
[10] "spam: Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030"
To insert values at specific places in a string, you can use a wrapper for the C function sprintf(). In the following example, "%s: %s" is a string wherein %s functions as a string placeholder to be replaced by the value(s) of the corresponding variable:
sprintf("%s: %s", head(smsspam$y, 10), texts) [1] "ham: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
[2] "ham: Ok lar... Joking wif u oni..."
[3] "spam: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
[4] "ham: U dun say so early hor... U c already then say..."
[5] "ham: Nah I don't think he goes to usf, he lives around here though"
[6] "spam: FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"
[7] "ham: Even my brother is not like to speak with me. They treat me like aids patent."
[8] "ham: As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"
[9] "spam: WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only."
[10] "spam: Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030"
See the documentation (?sprintf) for details and other placeholders.
Case
Use tolower() and toupper() to convert a string to lower or to upper case, respectively.
tolower(text)[1] "ok lar... joking wif u oni..."
toupper(text)[1] "OK LAR... JOKING WIF U ONI..."
Searching with regular expressions
grep() returns a vector of the indices of strings in a vector that yielded a match for the given regular expression. Applying it to a single string is thus not very useful:
grep("[[:digit:]]", texts[1])integer(0)
grep("[[:digit:]]", texts[3])[1] 1
grep("[[:digit:]]", texts)[1] 3 6 8 9 10
To get the strings containing matches, you can either use the returned indices to subset the original vector or just set value to TRUE:
# texts[grep("[[:digit:]]", texts)]
grep("[[:digit:]]", texts, value = TRUE)[1] "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
[2] "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"
[3] "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"
[4] "WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only."
[5] "Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030"
grepl() will return TRUE for each string in a vector that yielded a match:
grepl("[[:digit:]]", texts) [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE
As of R version 4.0, you can use raw strings. Although they are a little more awkward than in Python, this still makes it easier to escape special characters in more complicated regular expressions.
A raw string has an r in front of the opening (single or double) quotation mark and round brackets inside the quotation marks: r"(...)" (where ... stands for any character sequence).
grepl("\\{", "this string contains {curly} brackets") # regular string needs double backslash to escape [1] TRUE
grepl(r"(\{)", "this string contains {curly} brackets") # raw string only needs a single backslash to escape[1] TRUE
Replacing with regular expressions
gsub("[[:digit:]]", 0, texts[1])[1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
gsub("[[:digit:]]", 0, texts[3])[1] "Free entry in 0 a wkly comp to win FA Cup final tkts 00st May 0000. Text FA to 00000 to receive entry question(std txt rate)T&C's apply 00000000000over00's"
gsub("[[:digit:]]", 0, texts) [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
[2] "Ok lar... Joking wif u oni..."
[3] "Free entry in 0 a wkly comp to win FA Cup final tkts 00st May 0000. Text FA to 00000 to receive entry question(std txt rate)T&C's apply 00000000000over00's"
[4] "U dun say so early hor... U c already then say..."
[5] "Nah I don't think he goes to usf, he lives around here though"
[6] "FreeMsg Hey there darling it's been 0 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £0.00 to rcv"
[7] "Even my brother is not like to speak with me. They treat me like aids patent."
[8] "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *0 to copy your friends Callertune"
[9] "WINNER!! As a valued network customer you have been selected to receivea £000 prize reward! To claim call 00000000000. Claim code KL000. Valid 00 hours only."
[10] "Had your mobile 00 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 00000000000"
Tidyverse (stringr)
The main advantage of functions from stringr is probably that they are easier to find since they all start with str_.
stringr is built on top of stringi, a comprehensive package that contains many more string functions – stringr focusses on the most common tasks.
DataCamp offers a whole course on string manipulation with stringr: https://learn.datacamp.com/courses/string-manipulation-with-stringr-in-r
String length
str_length(text)[1] 29
str_length(texts) [1] 111 29 155 49 61 147 77 160 157 154
One caveat: Unicode allows characters to be combined. For example, the umlaut ü can be thought of as a combination of u (U+0075) and a diacritic (U+0308):
weird_umlaut <- "\u0075\u0308" # Apple users tend to produce these ...
weird_umlaut # NB: will be displayed as two characters in HTML output[1] "ü"
But there is also a single character which looks exactly the same:
regular_umlaut <- "\u00fc"
regular_umlaut[1] "ü"
These two code sequences are called canonically equivalent. But although they look the same, they are not:
regular_umlaut == weird_umlaut[1] FALSE
This may also result in unexpected string lengths:
umlauts <- c(regular_umlaut, weird_umlaut)
str_length(umlauts) # same for nchar()[1] 1 2
str_count(umlauts) # what we'd expect[1] 1 1
Luckily, there are different Unicode normalisation forms (see https://unicode.org/reports/tr15/) to handle canonical equivalence (and the weaker compatibility equivalence of characters which represent the same abstract character but are displayed differently, e.g. ² and 2).
NFC is the normalisation form for canonical decomposition, followed by canonical composition: If a combination of Unicode characters can be represented by a single Unicode character that looks the same, that’s the way to go.
When dealing with text data from multiple sources (e.g. text scraped from different web pages), you may want to perform Unicode normalisation to ensure you don’t run into problems later on.
Check if strings are in NFC form:
stringi::stri_trans_isnfc(umlauts)[1] TRUE FALSE
NFC normalisation:
umlauts <- stringi::stri_trans_nfc(umlauts)
umlauts[1] == umlauts[2][1] TRUE
Substrings
str_sub() works just as substr(), but you can also use negative values to count from the end of the string.
str_sub(text, start = 1L, end = 16L)[1] "Ok lar... Joking"
str_sub(text, start = -19L, end = -1L)[1] "Joking wif u oni..."
Replacement works almost the same way:
tmp <- text
str_sub(tmp, start = 11L, end = 16L) <- "Kidding"
tmp[1] "Ok lar... Kidding wif u oni..."
The difference is that the replacement string will always be inserted completely.
Splitting strings with str_split() works mostly the same as with strsplit(). You just get a few additional options, like simplify = TRUE to return a matrix instead of a list.
str_split(texts[1:2], " ")[[1]]
[1] "Go" "until" "jurong" "point," "crazy.." "Available"
[7] "only" "in" "bugis" "n" "great" "world"
[13] "la" "e" "buffet..." "Cine" "there" "got"
[19] "amore" "wat..."
[[2]]
[1] "Ok" "lar..." "Joking" "wif" "u" "oni..."
Concatenation and insertion
To join strings, use str_c(). The default value of sep is an empty string.
str_c(head(smsspam$y, 3), ": ", texts[1:3])[1] "ham: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
[2] "ham: Ok lar... Joking wif u oni..."
[3] "spam: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
You can “flatten” a vector of strings into a single string using str_flatten():
str_flatten(texts, "\n\n")[1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...\n\nOk lar... Joking wif u oni...\n\nFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\n\nU dun say so early hor... U c already then say...\n\nNah I don't think he goes to usf, he lives around here though\n\nFreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv\n\nEven my brother is not like to speak with me. They treat me like aids patent.\n\nAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune\n\nWINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.\n\nHad your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030"
You can use str_glue() to insert variable values into strings:
str_glue("{head(smsspam$y, 3)}: {texts[1:3]}")ham: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham: Ok lar... Joking wif u oni...
spam: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
Named arguments are also possible:
str_glue("{label}: {text}",
label = head(smsspam$y, 3),
text = texts[1:3])ham: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham: Ok lar... Joking wif u oni...
spam: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
Case
To convert a string to lower or upper case, use str_to_lower() or str_to_upper(). There are also functions for sentence and title case; see ?case.
Removing whitespace
To remove whitespace from start and end of strings, use str_trim(). To also remove repeated whitespace inside of strings, use str_squish().
stupid_string <- " \t some words\t\tand\n stuff "
str_trim(stupid_string)[1] "some words\t\tand\n stuff"
str_squish(stupid_string)[1] "some words and stuff"
Searching with regular expressions
str_count() offers a nice way to count the number of occurrences of a specific pattern in a string.
How many vowels are there in text? (Linguistically naïve, but still …)
str_count(text, "[aeiouAEIOU]")[1] 8
A very simple word count using a regular expression for alphanumericals (letters and digits):
str_count(texts, "[[:alnum:]]+") # simple word count [1] 20 6 33 11 14 36 16 26 26 29
str_detect() tells you if a string contains a given pattern:
str_detect(texts, "^[fF]ree") # does the string start with "Free" or "free"? [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
This is often useful to filter data sets:
smsspam |> filter(str_detect(text, "[0-9]{5,}")) # texts containing at least 5 consecutive digitsstr_extract() extracts the first matching string:
str_extract(texts, "[[:alnum:]]+") [1] "Go" "Ok" "Free" "U" "Nah" "FreeMsg" "Even"
[8] "As" "WINNER" "Had"
str_extract_all() extracts all matching strings (and therefore returns a list):
str_extract_all(texts[1:3], "[[:alnum:]]+")[[1]]
[1] "Go" "until" "jurong" "point" "crazy" "Available"
[7] "only" "in" "bugis" "n" "great" "world"
[13] "la" "e" "buffet" "Cine" "there" "got"
[19] "amore" "wat"
[[2]]
[1] "Ok" "lar" "Joking" "wif" "u" "oni"
[[3]]
[1] "Free" "entry" "in"
[4] "2" "a" "wkly"
[7] "comp" "to" "win"
[10] "FA" "Cup" "final"
[13] "tkts" "21st" "May"
[16] "2005" "Text" "FA"
[19] "to" "87121" "to"
[22] "receive" "entry" "question"
[25] "std" "txt" "rate"
[28] "T" "C" "s"
[31] "apply" "08452810075over18" "s"
Replacing with regular expressions
str_replace() does exactly what it promises, but only for the first matching occurrence:
str_replace(text, "i", "I")[1] "Ok lar... JokIng wif u oni..."
If you want to replace all occurrences of a given pattern, use str_replace_all() instead:
str_replace_all(texts, "\\b[0-9]+\\b", "NUMBER") [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
[2] "Ok lar... Joking wif u oni..."
[3] "Free entry in NUMBER a wkly comp to win FA Cup final tkts 21st May NUMBER. Text FA to NUMBER to receive entry question(std txt rate)T&C's apply 08452810075over18's"
[4] "U dun say so early hor... U c already then say..."
[5] "Nah I don't think he goes to usf, he lives around here though"
[6] "FreeMsg Hey there darling it's been NUMBER week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £NUMBER.NUMBER to rcv"
[7] "Even my brother is not like to speak with me. They treat me like aids patent."
[8] "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *NUMBER to copy your friends Callertune"
[9] "WINNER!! As a valued network customer you have been selected to receivea £NUMBER prize reward! To claim call NUMBER. Claim code KL341. Valid NUMBER hours only."
[10] "Had your mobile NUMBER months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on NUMBER"
Grouping and backreferences work, but we won’t go that far now.
Packages: NLP & openNLP
openNLP provides an interface to the Apache OpenNLP tools, a Java toolkit for typical NLP (natural language processing) tasks such as tokenisation, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution (https://opennlp.apache.org).
NLP has a special class for strings:
txt <- as.String(texts)This makes it possible to access substrings using square brackets:
txt[1, 21]Go until jurong point
Language settings
While English is available by default, you will need additional model files if you want to handle text in other languages. Pre-trained models are available here: http://opennlp.sourceforge.net/models-1.5/
You can conveniently install these from the repository at https://datacube.wu.ac.at. For example, if we wanted models for German, we’d do this:
install.packages("openNLPmodels.de", repos = "http://datacube.wu.ac.at/", type = "source")
Then, in a given openNLP function requiring a model, you’d set the parameter language to "de". For example:
Maxent_Sent_Token_Annotator(language = "de")
In case there’s more than one model available for a single component, such as the POS-Tagger, you can use the model parameter to specify which model you’d like to use:
Maxent_POS_Tag_Annotator(model = "path/to/Rlibrary/openNLPmodels.de/models/de-pos-perceptron.bin")
Note: Even if a model is available for a particular language and component, performance on your own data may be poor, e.g. because it differs greatly from the data on which the model was trained (usually specified in the model’s description). In this case, check for better models elsewhere, train a model yourself or ask a computational linguist for help!
Sentence annotation
This first step is needed to identify sentence boundaries.
We will create an object using the Maxent_Sent_Token_Annotator() function which we will then use as an argument in the annotate() function (from NLP).
sentence_annotator <- Maxent_Sent_Token_Annotator()
sentence_annotation <- annotate(txt, sentence_annotator)
sentence_annotation id type start end features
1 sentence 1 209
2 sentence 211 476
3 sentence 478 515
4 sentence 517 603
5 sentence 605 635
6 sentence 637 755
7 sentence 757 805
8 sentence 807 888
9 sentence 890 915
10 sentence 917 933
11 sentence 935 954
12 sentence 956 989
13 sentence 991 1063
14 sentence 1065 1109
If we check these boundaries, we can already see that it doesn’t work well with our SMS data:
txt[sentence_annotation] [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...\nOk lar... Joking wif u oni...\nFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005."
[2] "Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\nU dun say so early hor... U c already then say...\nNah I don't think he goes to usf, he lives around here though\nFreeMsg Hey there darling it's been 3 week's now and no word back!"
[3] "I'd like some fun you up for it still?"
[4] "Tb ok! XxX std chgs to send, £1.50 to rcv\nEven my brother is not like to speak with me."
[5] "They treat me like aids patent."
[6] "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers."
[7] "Press *9 to copy your friends Callertune\nWINNER!!"
[8] "As a valued network customer you have been selected to receivea £900 prize reward!"
[9] "To claim call 09061701461."
[10] "Claim code KL341."
[11] "Valid 12 hours only."
[12] "Had your mobile 11 months or more?"
[13] "U R entitled to Update to the latest colour mobiles with camera for Free!"
[14] "Call The Mobile Update Co FREE on 08002986030"
Line breaks were introduced by converting a vector of texts to a String object (txt). However, the sentence detector doesn’t properly recognise these as sentence boundaries. The same goes for the ellipis (“dot-dot-dot”) in several messages.
We probably need a better model for this (trained on SMS or CMC data), but if we aren’t interested in punctuation marks later on, we could also replace parts of our input (generally, we wouldn’t recommended this):
txt <- texts[1:5] |>
str_replace_all("\\.{2,}", "\\.") |> # replace several dots by single dots
str_flatten(collapse = "</sms> ") |> # combine SMS into single text
str_replace_all("([[:alnum:]])</sms>", "\\1.") |> # replace SMS endings without punctuation marks with dots
str_remove_all("</sms>") |> # clean up
as.String()
sentence_annotation <- annotate(txt, sentence_annotator)
txt[sentence_annotation] [1] "Go until jurong point, crazy."
[2] "Available only in bugis n great world la e buffet."
[3] "Cine there got amore wat."
[4] "Ok lar."
[5] "Joking wif u oni."
[6] "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005."
[7] "Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's."
[8] "U dun say so early hor."
[9] "U c already then say."
[10] "Nah I don't think he goes to usf, he lives around here though"
This looks better.
If we were to do sentence boundary detection on single SMS texts, we wouldn’t have to care about the issue regarding line breaks.
In general: If you notice errors in early annotation steps, think about error propagation. In an NLP pipeline (sentence boundary detection -> tokenisation -> PoS-tagging -> dependency parsing -> …), the accuracy of later steps depends upon the accuracy of earlier steps. If a text hasn’t been properly tokenised, POS tags will often be rubbish. If PoS tags are wrong, or if sentence boundaries are wrong or missing, this will affect the resulting dependency trees – and so on. Most importantly, however, it may eventually affect your actual analysis.
Ultimately, it may be worthwhile to invest some time in improving “boring” or “solved” pre-processing and annotation steps.
Tokenisation
We can now tokenise the text. This time, we pass an additional argument to annotate(): the annotations to start with, namely sentence_annotation.
token_annotator <- Maxent_Word_Token_Annotator()
token_annotation <- annotate(txt,
token_annotator,
sentence_annotation)
token_annotation id type start end features
1 sentence 1 29 constituents=<<integer,7>>
2 sentence 31 80 constituents=<<integer,11>>
3 sentence 82 106 constituents=<<integer,6>>
4 sentence 108 114 constituents=<<integer,3>>
5 sentence 116 132 constituents=<<integer,5>>
6 sentence 134 200 constituents=<<integer,17>>
7 sentence 202 289 constituents=<<integer,15>>
8 sentence 291 313 constituents=<<integer,7>>
9 sentence 315 335 constituents=<<integer,6>>
10 sentence 337 397 constituents=<<integer,15>>
11 word 1 2
12 word 4 8
13 word 10 15
14 word 17 21
15 word 22 22
16 word 24 28
17 word 29 29
18 word 31 39
19 word 41 44
20 word 46 47
21 word 49 53
22 word 55 55
23 word 57 61
24 word 63 67
25 word 69 70
26 word 72 72
27 word 74 79
28 word 80 80
29 word 82 85
30 word 87 91
31 word 93 95
32 word 97 101
33 word 103 105
34 word 106 106
35 word 108 109
36 word 111 113
37 word 114 114
38 word 116 121
39 word 123 125
40 word 127 127
41 word 129 131
42 word 132 132
43 word 134 137
44 word 139 143
45 word 145 146
46 word 148 148
47 word 150 150
48 word 152 155
49 word 157 160
50 word 162 163
51 word 165 167
52 word 169 170
53 word 172 174
54 word 176 180
55 word 182 185
56 word 187 190
57 word 192 194
58 word 196 199
59 word 200 200
60 word 202 205
61 word 207 208
62 word 210 211
63 word 213 217
64 word 219 220
65 word 222 228
66 word 230 234
67 word 236 247
68 word 249 251
69 word 253 260
70 word 261 262
71 word 264 268
72 word 270 286
73 word 287 288
74 word 289 289
75 word 291 291
76 word 293 295
77 word 297 299
78 word 301 302
79 word 304 308
80 word 310 312
81 word 313 313
82 word 315 315
83 word 317 317
84 word 319 325
85 word 327 330
86 word 332 334
87 word 335 335
88 word 337 339
89 word 341 341
90 word 343 344
91 word 345 347
92 word 349 353
93 word 355 356
94 word 358 361
95 word 363 364
96 word 366 368
97 word 369 369
98 word 371 372
99 word 374 378
100 word 380 385
101 word 387 390
102 word 392 397
Let’s have a look at the actual tokens:
txt[token_annotation[token_annotation$type == "word"]] |> head(18) [1] "Go" "until" "jurong" "point" "," "crazy"
[7] "." "Available" "only" "in" "bugis" "n"
[13] "great" "world" "la" "e" "buffet" "."
PoS-Tagging
To add part-of-speech tags, we proceed in the same manner:
pos_tagger <- Maxent_POS_Tag_Annotator()
annotation <- annotate(txt, pos_tagger, token_annotation)
annotation id type start end features
1 sentence 1 29 constituents=<<integer,7>>
2 sentence 31 80 constituents=<<integer,11>>
3 sentence 82 106 constituents=<<integer,6>>
4 sentence 108 114 constituents=<<integer,3>>
5 sentence 116 132 constituents=<<integer,5>>
6 sentence 134 200 constituents=<<integer,17>>
7 sentence 202 289 constituents=<<integer,15>>
8 sentence 291 313 constituents=<<integer,7>>
9 sentence 315 335 constituents=<<integer,6>>
10 sentence 337 397 constituents=<<integer,15>>
11 word 1 2 POS=VB
12 word 4 8 POS=IN
13 word 10 15 POS=JJ
14 word 17 21 POS=NN
15 word 22 22 POS=,
16 word 24 28 POS=JJ
17 word 29 29 POS=.
18 word 31 39 POS=JJ
19 word 41 44 POS=RB
20 word 46 47 POS=IN
21 word 49 53 POS=NN
22 word 55 55 POS=RB
23 word 57 61 POS=JJ
24 word 63 67 POS=NN
25 word 69 70 POS=DT
26 word 72 72 POS=NN
27 word 74 79 POS=NN
28 word 80 80 POS=.
29 word 82 85 POS=NNP
30 word 87 91 POS=RB
31 word 93 95 POS=VBD
32 word 97 101 POS=RB
33 word 103 105 POS=NN
34 word 106 106 POS=.
35 word 108 109 POS=NN
36 word 111 113 POS=JJ
37 word 114 114 POS=.
38 word 116 121 POS=VBG
39 word 123 125 POS=IN
40 word 127 127 POS=PRP
41 word 129 131 POS=NNS
42 word 132 132 POS=.
43 word 134 137 POS=JJ
44 word 139 143 POS=NN
45 word 145 146 POS=IN
46 word 148 148 POS=CD
47 word 150 150 POS=DT
48 word 152 155 POS=RB
49 word 157 160 POS=JJ
50 word 162 163 POS=TO
51 word 165 167 POS=VB
52 word 169 170 POS=NNP
53 word 172 174 POS=NNP
54 word 176 180 POS=JJ
55 word 182 185 POS=NNS
56 word 187 190 POS=JJ
57 word 192 194 POS=NNP
58 word 196 199 POS=CD
59 word 200 200 POS=.
60 word 202 205 POS=NNP
61 word 207 208 POS=VBD
62 word 210 211 POS=TO
63 word 213 217 POS=CD
64 word 219 220 POS=TO
65 word 222 228 POS=VB
66 word 230 234 POS=NN
67 word 236 247 POS=NN
68 word 249 251 POS=IN
69 word 253 260 POS=NNP
70 word 261 262 POS=POS
71 word 264 268 POS=VB
72 word 270 286 POS=IN
73 word 287 288 POS=PRP
74 word 289 289 POS=.
75 word 291 291 POS=NNP
76 word 293 295 POS=NN
77 word 297 299 POS=VBP
78 word 301 302 POS=RB
79 word 304 308 POS=JJ
80 word 310 312 POS=NN
81 word 313 313 POS=.
82 word 315 315 POS=NNP
83 word 317 317 POS=NN
84 word 319 325 POS=RB
85 word 327 330 POS=RB
86 word 332 334 POS=VBP
87 word 335 335 POS=.
88 word 337 339 POS=NNP
89 word 341 341 POS=PRP
90 word 343 344 POS=VBP
91 word 345 347 POS=RB
92 word 349 353 POS=VB
93 word 355 356 POS=PRP
94 word 358 361 POS=VBZ
95 word 363 364 POS=TO
96 word 366 368 POS=RP
97 word 369 369 POS=,
98 word 371 372 POS=PRP
99 word 374 378 POS=VBZ
100 word 380 385 POS=RB
101 word 387 390 POS=RB
102 word 392 397 POS=IN
We can now work with the annotated text – or first transform it to another format.
tokens <- txt[annotation[annotation$type == "word"]]
# annotation$features is a list of lists which makes it a little
# difficult to handle:
# annotation$features[annotation$type == "word"]
# We can use unlist() to collapse list elements into a vector:
pos_sequence <- as.character(unlist(
annotation$features[
annotation$type == "word"
]
))
# We could also transform it to a data.frame:
# annotation |> as.data.frame() |> filter(type == "word")
# Or we could construct a tibble out of tokens and PoS tags
# (although it would probably be nice to further add sentence and
# text IDs:
tibble(ID = 1:length(tokens), Token = tokens, POS = pos_sequence)Package: corpustools
corpustools is a package designed to manage, query and analyse tokenised text. For advanced pre-processing (e.g. part-of-speech tagging or parsing), it relies on other packages, such as spacyr, coreNLP or udpipe.
There’s a package vignette that provides a nice overview: https://cran.r-project.org/web/packages/corpustools/vignettes/corpustools.html
We’ll first create a corpus from a data.frame. Because every document needs a unique ID, we’ll have to add a new column to our SMS dataset:
smsspam_df <- data.frame(smsspam) # copy
smsspam_df$id <- sprintf("doc_%05d", 1:nrow(smsspam)) # new column
corpus <- create_tcorpus(smsspam_df,
doc_column = 'id',
text_columns = 'text',
split_sentences = TRUE,
verbose = FALSE) # suppress progress bar
corpustCorpus containing 108985 tokens
grouped by documents (n = 5574) and sentences (n = 11467)
contains:
- 4 columns in $tokens: doc_id, sentence, token_id, token
- 2 columns in $meta: doc_id, y
Let’s look at the tokens:
head(corpus$tokens)And the metadata:
head(corpus$meta)We can also import pre-processed tokens. We’ll use the sample data corenlp_tokens to demonstrate this.
head(corenlp_tokens)tc = tokens_to_tcorpus(corenlp_tokens,
doc_col = 'doc_id',
sentence_col = 'sentence',
token_id_col = 'id')
tctCorpus containing 36 tokens
grouped by documents (n = 1) and sentences (n = 6)
contains:
- 13 columns in $tokens: doc_id, sentence, token_id, token, lemma, CharacterOffsetBegin, CharacterOffsetEnd, POS, NER, Speaker, parent, relation, pos1
- 1 column in $meta: doc_id
Subsetting
We can use subset() to filter our corpus (see ?subset.tCorpus).
Use the parameter subset to specify which rows to keep in the tokens data:
sent347 <- subset(corpus, subset = sentence == 2)
sent347tCorpus containing 27648 tokens
grouped by documents (n = 2996) and sentences (n = 2996)
contains:
- 4 columns in $tokens: doc_id, sentence, token_id, token
- 2 columns in $meta: doc_id, y
Use the parameter subset_meta to specify which documents to keep:
doc50 <- subset(corpus, subset_meta = doc_id == "doc_00050")
doc50$tokensfirst_spam <- subset(corpus,
subset = token_id == 1,
subset_meta = y == "spam")
first_spam$tokens |> select(doc_id, token)Pre-processing
An object of the class tCorpus (like our corpus) has quite a few class methods we can use (have a look at str(corpus)).
Keep in mind that these class methods usually directly modify the corpus object. If you want to keep the original object, use tCorpus$copy() first:
corpus_old <- corpus$copy()
Arguably the most important class method is tCorpus$preprocess() which allows us to apply typical pre-processing steps to the whole corpus:
corpus$preprocess(column = 'token',
new_column = 'token2',
lowercase = TRUE,
remove_punctuation = FALSE,
remove_stopwords = FALSE,
remove_numbers = FALSE,
use_stemming = FALSE,
language = "english")
corpus$tokensDeduplication
We can remove identical or very similar documents from our corpus:
nrow(corpus$meta)[1] 5574
corpus$deduplicate(feature='token2',
similarity = 0.95, # very similar documents
print_duplicates = FALSE) # set to TRUE if you want the IDs of removed documentsas(<dgTMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "CsparseMatrix") instead
Deleting 521 duplicates
nrow(corpus$meta)[1] 5053
Document feature matrix
A document-term matrix is a matrix where each row represents a single document and each column represents a feature (e.g. a token). The values in the matrix represent the frequency of a given term in a given document.
A document-term matrix can be used to tell us something about the content or style of a document. This information can then be used to retrieve relevant documents, to group similar documents by topic or to compare the style of different authors.
Depending on the task at hand, the procedure will differ, but the basic principle remains the same.
In many use cases, it makes sense to remove certain words (e.g. function words or very infrequent words) before constructing a document-term matrix. (On the other hand, sometimes function words can be very interesting, for example in stylometry. So think before thinning out your corpus!)
corpus$preprocess(column = 'token',
new_column = 'feature',
remove_stopwords = TRUE,
remove_numbers = TRUE,
use_stemming = TRUE,
min_docfreq = 5)
dfm <- get_dfm(corpus, 'feature')
dfmDocument-feature matrix of: 5,053 documents, 1,384 features (99.50% sparse) and 1 docvar.
features
docs 10am 10p 12hrs 150ppm 150p 1st 2land 2mrw 2nite 2day
doc_00001 0 0 0 0 0 0 0 0 0 0
doc_00002 0 0 0 0 0 0 0 0 0 0
doc_00003 0 0 0 0 0 0 0 0 0 0
doc_00004 0 0 0 0 0 0 0 0 0 0
doc_00005 0 0 0 0 0 0 0 0 0 0
doc_00006 0 0 0 0 0 0 0 0 0 0
[ reached max_ndoc ... 5,047 more documents, reached max_nfeat ... 1,374 more features ]
You can see that most values in the matrix are zero. A matrix like this is called a sparse matrix. The percentage of zero-valued elements is often called sparsity – our matrix here is 99.5% sparse, so only 0.5% of elements are non-zero. Specialised algorithms and data structures can take advantage of sparsity, while “regular” ones used on dense matrices can be slow and inefficient as they waste time and memory on all those zeros.
The raw frequencies seen above can also be weighted differently. The best procedure depends on what you want to do!
A common approach is term frequency–inverse document frequency (tf-idf). Term frequency is what we’ve got so far: how often a token (or term) occurs in a document. (A number of adjustments is possible, e.g. for document length.)
Document frequency, on the other hand, is how many documents contain a certain token (term) – regardless of how often it occurs within individual documents. Inverse document frequency is just the inverse (1 / df), so the smaller this value, the more common the token across documents. We usually take the natural logarithm of this quotient.
Finally, tf-idf is the product of term frequency and inverse document frequency. Idf therefore functions as a weight for the term frequency: the less common a token is across documents, the more important it is in a document in which it occurs. And vice versa, the more common a token generally is (think of words like is, and, the, people etc.), the less important it is in a single document, even if it occurs very frequently.
dfm_weighted <- get_dfm(corpus, 'feature', weight = 'tfidf')
dfm_weightedDocument-feature matrix of: 5,053 documents, 1,384 features (99.50% sparse) and 1 docvar.
features
docs 10am 10p 12hrs 150ppm 150p 1st 2land 2mrw 2nite 2day
doc_00001 0 0 0 0 0 0 0 0 0 0
doc_00002 0 0 0 0 0 0 0 0 0 0
doc_00003 0 0 0 0 0 0 0 0 0 0
doc_00004 0 0 0 0 0 0 0 0 0 0
doc_00005 0 0 0 0 0 0 0 0 0 0
doc_00006 0 0 0 0 0 0 0 0 0 0
[ reached max_ndoc ... 5,047 more documents, reached max_nfeat ... 1,374 more features ]
When tidytext is loaded, you can use tidy() to convert a document-feature matrix to a tibble. This might make it easier to get a sense of the data.
dfm_weighted |> tidy() |> arrange(desc(count))This should probably tell us that more filtering before creating the dtm and/or better pre-processing and tokenisation might be a good idea. :)
Search the corpus
query_result <- search_features(corpus,
feature = "token",
query = c('call', 'phone'))
table(as.character(query_result$hits$feature))
call Call CALL phone Phone PHONE
296 118 21 94 7 4
Visualise results
queries <- data.frame(label = c('call', 'email', 'meet'),
query = c('call* OR phone',
'email OR e-mail',
'meet* OR see*'))
hits <- search_features(corpus,
query = queries$query,
code = queries$label)
count_tcorpus(corpus, hits = hits)category_hits = count_tcorpus(corpus,
hits,
meta_cols = 'y',
wide = FALSE)
ggplot(category_hits, aes(x = y, y = count, fill = code)) +
geom_col(position = "dodge") # geom_col() = geom_bar(stat = "identity")Display results
If we’re also interested to see results in context, we can use browse_hits() to create a static HTML page of our hits or get_kwic() to get a data.frame with a KWIC (keyword in context) column.
url = browse_hits(corpus, hits, view = TRUE)
get_kwic(corpus, query = "call*", n = 2)Package: tidytext
tidytext is designed to work with tidy text data, meaning one token per row. Some of the structures we’ve seen above don’t conform to this, so tidytext offers the tidy() function to tidy data from other packages and to convert tidy data into other formats expected by these packages.
Create a tibble out of a tCorpus object:
sms_tidy <- as_tibble(corpus$tokens) |>
inner_join(corpus$meta, by = 'doc_id') |>
mutate(word = token2)
sms_tidyLooking at the feature column, we can see that the stemming we did earlier looks pretty horrible.
Alternatively, we could apply the tidytext function unnest_tokens() to the raw texts to get a similar result:
smsspam |>
as_tibble() |>
mutate(text_id = 1:n()) |> # add text ids
unnest_tokens(output = sentence, input = text, token = "sentences") |> # split sentences
group_by(text_id) |> # group by text for next step
mutate(sentence_id = 1:n()) |> # add sentence ids
ungroup() |> # no more grouping for next step
unnest_tokens(output = token, input = sentence, token = "words", drop = FALSE, strip_punct = FALSE) |> # tokenise each sentence, keep other columns
group_by(text_id, sentence_id) |> # group for next step (could also just group by text id)
mutate(token_id = 1:n()) |>
ungroup()Counting words
sms_tidy |> count(word, sort = TRUE)Removing stop words
Stop word lists are far from ideal, but sometimes, they are enough. get_stopwords() gives us a lexicon of words commonly excluded from further analysis. We can then do an anti_join() to remove them from our data.
sms_clean <- sms_tidy |>
anti_join(tidytext::get_stopwords("en"))Joining, by = "word"
nrow(sms_clean) # number of rows now[1] 66290
nrow(sms_tidy) # number of rows before[1] 96537
sms_clean |> count(word, sort = TRUE)We’ve still got punctuation and very short words in there, so let’s remove all that as well:
sms_clean <- sms_clean |> filter(str_length(word) > 2)
sms_clean |> count(word, sort = TRUE)Word clouds
Let’s give wordcloud a try this time.
sms_clean |> count(word) |> with(wordcloud(word, n, max.words=50))Of course, ggwordcloud works just as well – and allows us to group the words by classification (spam or not).
set.seed(1)
sms_clean |>
group_by(y) |>
count(word) |>
arrange(y, desc(n)) |>
summarise(head(across(), 100)) |>
ungroup() |>
mutate(angle = 90 * sample(c(0, 1), n(),
replace = TRUE, prob = c(60, 40))) |>
ggplot(aes(label = word, size = n,
colour = y, x = y,
angle = angle)) +
geom_text_wordcloud(area_corr = TRUE) +
scale_size_area(max_size = 12) +
labs(title = "Most frequent words", x = "Classification") +
theme_minimal() +
theme(panel.grid.major = element_blank())`summarise()` has grouped output by 'y'. You can override using the `.groups`
argument.
Document-feature matrix
Instead of using the document-term matrix from above, we can also easily create a new one without using the corpustools package.
tidy_dtm <- sms_clean |>
group_by(doc_id, word) |>
summarise(count = n()) |>
ungroup() |>
arrange(doc_id, word)`summarise()` has grouped output by 'doc_id'. You can override using the
`.groups` argument.
rare_terms <- sms_clean |>
count(word) |>
filter(n < 4) |>
arrange(desc(n))
tidy_dtm <- tidy_dtm |>
anti_join(rare_terms) # remove rare termsJoining, by = "word"
tidy_dtmTo get a proper matrix, we can convert a tibble like this using cast_dfm(), specifying which columns in our data contain document identifiers, terms and frequencies.
tidy_dtm |>
mutate(word = as.character(word)) |> # next function doesn't like factors
cast_dfm(doc_id, word, count)Document-feature matrix of: 4,957 documents, 1,740 features (99.65% sparse) and 0 docvars.
features
docs available bugis cine crazy got great world wat point joking
doc_00001 1 1 1 1 1 1 1 1 1 0
doc_00002 0 0 0 0 0 0 0 0 0 1
doc_00003 0 0 0 0 0 0 0 0 0 0
doc_00004 0 0 0 0 0 0 0 0 0 0
doc_00005 0 0 0 0 0 0 0 0 0 0
doc_00006 0 0 0 0 0 0 0 0 0 0
[ reached max_ndoc ... 4,951 more documents, reached max_nfeat ... 1,730 more features ]
We can also use cast_dtm() to get a document-feature matrix in the DocumentTermMatrix format from the tm (text mining) package.
tidy_dtm |>
mutate(word = as.character(word)) |> # next function doesn't like factors
cast_dtm(doc_id, word, count)<<DocumentTermMatrix (documents: 4957, terms: 1740)>>
Non-/sparse entries: 30542/8594638
Sparsity : 100%
Maximal term length: 17
Weighting : term frequency (tf)
Both functions have the optional parameter weighting […]
For tf-idf, there’s also the function bind_tf_idf() which can be applied to a tidy tibble with one row per token (or term), per document (such as tidy_dtm):
tidy_dtm |> bind_tf_idf(word, doc_id, count)Topic modelling
Topic modelling uses Latent Dirichlet Allocation (LDA) to extract a number of topics from a collection of texts.
The required input for LDA() is a document-term matrix in the DocumentTermMatrix format from the tm (text mining) package.
sms_lda <- tidy_dtm |>
mutate(word = as.character(word)) |> # next function doesn't like factors
cast_dtm(doc_id, word, count) |>
LDA(k = 4)
topics <- tidy(sms_lda, matrix = "beta")
topicsLet’s see what the top 20 terms for each topic are:
top_terms <- topics |>
group_by(topic) |>
slice_max(beta, n = 20) |>
ungroup() |>
arrange(topic, -beta)
top_terms |>
mutate(term = reorder_within(term, beta, topic)) |>
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free", ncol = 4) +
scale_y_reordered()Honestly, we can’t make much sense of this. We could filter more to get a better input for the LDA, but it may also be that topic modelling just doesn’t work well for such short messages.
We’ll give it another try with some CMC data from Twitter and Reddit. Since we’ve got part-of-speech tags in this case, we’ll only keep certain content words for our analysis – no need for stop word lists!
cmc <- read_csv2("../data/cmc-sample.csv")ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
Rows: 104128 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (7): token, pos, lemma, lemma_ab, lemma_nd, comment_ab, source
dbl (1): id
lgl (1): comment_nd
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cmc$textid <- str_extract(cmc$token, "(?<=<text_id )[[:alnum:]]+")
cmc <- cmc |> fill(textid, .direction = "downup")
# new column: year/month
cmc$year_month <- str_extract(cmc$token, "(?<=<text_ym )[0-9]+")
cmc <- cmc |> fill(year_month, .direction = "downup")
# new column: sentence id
cmc <- cmc |> mutate(sid = cumsum(str_detect(token, "<s>")))
# no more XML tags
cmc <- cmc |> filter(!is.na(pos))
# new column: extract year from year_month
cmc$year <- str_extract(cmc$year_month, "[0-9]{4}")
# replace HTML entities
cmc <- cmc |> mutate(token = str_replace(token, "'", "'"),
token = str_replace(token, """, '"'),
token = str_replace(token, "&", "&"),
token = str_replace(token, "<", "<"),
token = str_replace(token, ">", ">"),
token = str_replace(token, "'", "'"),
token = str_replace(token, "%", "%"))
tags <- c("ADJD", "ADJA", "NN", "NE", "TRUNC", "HST") # c("ADJD", "ADJA", "NN", "NE", "TRUNC", "VVFIN", "VVIMP", "VVINF", "VVIZU", "VVPP", "HST")
cmc_content <- cmc |>
filter(pos %in% tags,
str_length(lemma) > 1) |>
select(source, textid, sid, tid = id, token, lemma = lemma_ab,
pos, year_month, year)
cmc_contentcmc_dtm <- cmc_content |>
group_by(textid) |>
count(lemma) |>
ungroup() |>
arrange(textid, lemma)
cmc_rare_terms <- cmc_content |>
count(lemma) |>
filter(n < 4) |>
arrange(desc(n))
cmc_dtm <- cmc_dtm |>
anti_join(cmc_rare_terms, by = "lemma")
cmc_dtm <- cmc_dtm |> bind_tf_idf(lemma, textid, n)
cmc_dtm <- cmc_dtm |> filter(tf < .5, tf_idf >= .3)
cmc_dtmcmc_lda <- cmc_dtm |>
cast_dtm(textid, lemma, n) |>
LDA(k = 12, control = list(nstart = 50)) # 50 repeated runs with random initialisations
# LDA(k = 12, method = "Gibbs", control = list(seed = 42, # setting a seed makes the result reproducible
# iter = 3000,
# thin = 10,
# burnin = 1000))
topics <- tidy(cmc_lda, matrix = "beta")
topicstop_terms <- topics |>
group_by(topic) |>
slice_max(beta, n = 20) |>
ungroup() |>
arrange(topic, -beta)
top_terms |>
mutate(term = reorder_within(term, beta, topic)) |>
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
scale_y_reordered() +
labs(x = expression(beta), y = "Lemma")In contrast to the topics from the SMS corpus, these topics are at least interpretable and partly coincide with our own observations (e.g. Brexit being a popular topic).
However, the topics are relatively unstable, so that you often get somewhat different topics in different runs. A higher value of nstart (for more repeated runs) may increase the topic stability, but will also greatly increase the execution time.
Topic modelling also works much better for collections of longer documents.
Which lemmas in individual documents are assigned to which topics?
augment(cmc_lda, cmc_dtm |> rename(document = textid, term = lemma)) |>
select(document, term, .topic, everything())We can also look at the estimated proportions of terms from each document belonging to each topic:
cmc_gamma <- tidy(cmc_lda, matrix = "gamma") |>
arrange(document, desc(gamma))
cmc_gammaAre there differences between Reddit and Twitter regarding the estimated proportions?
document_source <- cmc |>
select(textid, source, year) |>
rename(document = textid) |>
distinct()
cmc_gamma <- cmc_gamma |>
left_join(document_source)Joining, by = "document"
cmc_gamma |>
ggplot(aes(x = factor(topic), y = gamma, colour = source)) +
geom_boxplot(outlier.alpha = .1) +
labs(x = "Topic", y = expression(gamma), colour = "Source")Sentiment analysis
Sentiment analysis, or opinion mining, in its simplest form is used to assess whether a text (or a section of a text) is positive or negative (e.g. product reviews). More sophisticated forms of sentiment analysis are used to assess a wider range of emotional content (surprise, anger, love, disgust, …).
An easy (but crude) approach to sentiment analysis would be to use a sentiment lexicon which assigns sentiment labels to specific tokens (e.g. idiotic => negative sentiment).
sentiment_scores <- tidytext::get_sentiments(lexicon = "bing")
positive <- sentiment_scores |> filter(sentiment == "positive")
negative <- sentiment_scores |> filter(sentiment == "negative")
sms_tidy |>
filter(y == "ham") |>
semi_join(positive) |>
count(word, sort = TRUE)Joining, by = "word"
sms_tidy |>
filter(y == "spam") |>
semi_join(positive) |>
count(word, sort = TRUE)Joining, by = "word"
sms_tidy |>
inner_join(sentiment_scores) |>
ggplot(aes(x = y, fill = sentiment)) +
geom_bar(position = "dodge")Joining, by = "word"
Further reading and exercises
- DataCamp course which covers document term matrices, TF-IDF, topic modelling and more: https://campus.datacamp.com/courses/introduction-to-natural-language-processing-in-r
- Arnold/Tilton (2015) has a chapter on NLP in R and a chapter on text analysis which demonstrate some DH applications.
- https://www.tidytextmining.com also covers stuff like TF-IDF and topic models in more detail.